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Using the VBARMS method in parallel computing 

Bruno Carpentieri* Jia Liao* Masha Sosonkina' Alclo Bonfigliolh 


Abstract 

The paper describes an improved parallel MPI-based implementation of VBARMS, a variable 
block variant of the pARMS preconditioner proposed by Li, Saad and Sosonkina [NLAA, 2003] for 
solving general nonsymmetric linear systems. The parallel VBARMS solver can detect automatically 
exact or approximate dense structures in the linear system, and exploits this information to achieve 
improved reliability and increased throughput during the factorization. A novel graph compression 
algorithm is discussed that finds these approximate dense blocks structures and requires only one 
simple to use parameter. A complete study of the numerical and parallel performance of parallel 
VBARMS is presented for the analysis of large turbulent Navier-Stokes equations on a suite of three- 
dimensional test cases. 

Keywords: Linear systems, incomplete LU factorization preconditioners, graph compression 
techniques, parallel performance, distributed-memory computers. 


1 Introduction 


The initial motivation for this study is the design of robust preconditioning techniques for solving sparse 
block structured linear systems arising from the finite element / finite volume analysis of turbulent flows 
in computational fluid dynamics applications. Over the last few years we have developed block multilevel 
incomplete LU (ILU) factorization methods for this problem class, and we have found them very effective 
in reducing the number of GMRES iterations compared to their pointwise analogues [8]. This class of 
preconditioners can offer higher parallelism and robustness than standard ILU algorithms especially for 
solving large problems, thanks to their multilevel mechanism. Exploiting existing block structures in 
the matrix can help reduce numerical instabilities during the factorization and achieve higher flops to 
memory ratios on modern cache-based computer architectures. Sparse matrices arising from the solution 
of systems of partial differential equations often exhibit perfect block structures consisting of fully dense 
(typically small) nonzero blocks in their sparsity pattern, e.g., when several unknown physical quantities 
are associated with the same grid point. For example, a plane elasticity problem has both x- and y- 
displacements at each grid point; a Navier-Stokes system for turbulent compressible flows would have 
five distinct variables (the density, the scaled energy, two components of the scaled velocity, and the 
turbulence transport variable) assigned to each node of the physical mesh; a bidomain system in cardiac 
electrical dynamics couples the intra-and extra-cellular electric potential at each ventricular cell of the 
heart. Upon numbering consecutively the t distinct variables associated with the same grid point, the 
permuted matrix has a sparse block structure with nonzero blocks of size i x i. The blocks are fully dense 
if variables at the same node are mutually coupled. 
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Our recently developed variable block algebraic recursive multilevel solver (shortly, VBARMS) can detect 
fine-grained dense structures in the linear system automatically, without any user’s knowledge of the 
underlying problem, and exploit them efficiently during the factorization [8], Preliminary experiments 
with a parallel MPI-based implementation of VBARMS for distributed memory computers, presented 
in a conference contribution [7 , showed the robustness of the proposed method for solving some larger 
matrix problems arising in different fields. In this paper, capitalizing on those results, we introduce a 
new graph-based compression algorithm to construct the block ordering in VBARMS, which extends the 
method proposed by Ashcraft in [I] and requires only one simple to use parameter (Section^; we describe 
in Section [3] a novel implementation of the block partial factorization step that proves to be noticeably 
faster than the original one presented in [81; finally, in Section |4j we assess the parallel performance of 
our parallel VBARMS code for solving turbulent Navier-Stokes equations in fully coupled form on large 
realistic three-dimensional meshes; in the new parallel implementation, we use a parallel graph partitioner 
to reduce the graph partitioning time significantly compared to the experiments presented in 7 . 


2 Graph compression techniques 


It is known that block iterative methods often show faster convergence rate than their pointwise analogues 
in the solution of many classes of two- and three-dimensional partial differential equations (PDEs). 
When the domain is discretized by cartesian grids, a regular partition may also provide an effective 
matrix partitioning. For example, in the case of the simple Poisson’s equation with Dirichlet boundary 
conditions, defined on a rectangle (0,£i) x (0, £ 2 ) discretized uniformly by n\ + 2 points in the interval 
(0, £ 1 ) and 71-2 + 2 points in (0, (.%), upon numbering the interior points in the natural ordering by lines 
from the bottom up, one obtains a ri 2 x 72,2 block tridiagonal matrix with square blocks of size ni x m; 
the diagonal blocks are tridiagonal matrices and the off-diagonal blocks are diagonal matrices. For large 
finite element discretizations, it is common to use substructuring, where each substructure of the physical 
mesh corresponds to one sparse block of the system. However, if the domain is highly irregular or the 
matrix does not correspond to a differential equation, finding the best block partitioning is much less 
obvious. In this case, graph reordering techniques are worth considering. 

The PArameterized BLock Ordering (PABLO) method proposed by O’Neil and Szyld is one of the 
first matrix partitioning algorithms specifically designed for block iterative solvers 15 . The algorithm 


selects groups of nodes in the adjacency graph of the coefficient matrix such that the corresponding 
diagonal blocks are either full or very dense. It has been shown that classical block stationary iterative 
methods such as block Gauss-Seidel and SOR methods combined with the PABLO ordering require fewer 
operations than their point analogues for the the finite element discretization of a Dirichlet problem on 
a graded L-shaped region, as well as on the 9-point discretization of the Laplacian operator on a square 
grid. The complexity of the PABLO algorithm is proportional to the number of nodes and edges in both 
time and space. 

Another useful approach for blocking a matrix A is to find block independent sets in the adjacency graph 
of A 21 . A block independent set is defined as a set of groups of nodes (or unknowns) having the property 
that there is no coupling between nodes of any two different groups, while nodes within the same group 
may be coupled. Independent sets of unknowns in a linear system can be eliminated simultaneously at 
a given stage of Gaussian Elimination. For this reason, this type of oredering is extensively adopted 
in linear solvers design. Independent sets may be computed by using simple graph algorithms which 
traverse the vertices of the adjacency graph of A in the natural order 1,2,... ,n, mark each visited vertex 
v and all of its nearest neighbors connected to v by an edge, and add v and each visited node that is 
not already marked to the current independent set partition 18 . Upon renumbering nodes one partition 


after the other, followed as last by interface nodes straddling between separate partitions, one obtain a 
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permutation of A in the form 


PAP 1 = 


D F 
E C 


( 1 ) 


where D is a block diagonal matrix. The nested dissection ordering by George 10 , mesh partitioning, 
or further information from the set of nested finite element grids of the underlying problem J2,3,61 can 
be used as an alternative to the greedy independent set algorithm described above. Additionally, the 


numerical values of A may be incorporated in the ordering to produce more robust factorizations 21 


However, finite element and finite difference matrices often possess also fine-grained block structures that 
can be exploited in iterative solvers. If there is more than one solution component at a grid point, the 
corresponding matrix entries may form a small dense block and optimized codes can be used for dense 
factorizations in the construction of the preconditioner and dense matrix-vector products in the sparse 
matrix-vector product operation for better performance, see e.g. [8j[II][l9|[23j[24]. A block incomplete 
LU factorization (ILU) method is one preconditioning technique that treats small dense submatrices 
of A as single entities, and the VBARMS method discussed in this paper can be seen as its natural 
multilevel generalization. An important advantage of block ILU versus conventional ILU is the potential 
gain obtained from using optimized level 3 basic linear algebra subroutines (BLAS3). Column indices and 
pointers can be saved by storing the matrix as a collection of blocks using the variable block compressed 
sparse row (VBCSR) format, where each value in the CSR format is a dense array. On indefinite problems, 
computing with blocks instead of single elements enables us a better control of pivot breakdowns, near 
singularities, and other sources of numerical instabilities. These facts have been assessed in our previous 
contribution |(8:. 

The method proposed by Ashcraft in 1 is one of the first compression techniques for finding dense blocks 
in the sparsity pattern of a matrix. The algorithm searches for sets of rows or columns having the exact 
same pattern. From a graph viewpoint, it looks for vertices of the adjacency graph (V,E) of A having 
the same adjacency list. These are also called indistinguishable nodes or cliques. The algorithm assigns 
a checksum quantity to each vertex, e.g., using the function 


chk(u) = 


(u,w)£E 


W, 


( 2 ) 


and then sorts the vertices by their checksums. This operation takes \E\ + |Vj log |Vj time. If u and v 
are indistinguishable, then chk(u) = chk{v). Therefore, the algorithm examine nodes having the same 
checksum to see if they are indistinguishable. The ideal checksum function would assign a different value 
for each different row pattern that occurs but it is not practical because it may quickly lead to huge 
numbers that may not even be machine-representable. Since the time cost required by Ashcraft’s method 
is generally negligible relative to the time it takes to solve the system, simple checksum functions such 
as ([2]) are used in practice [l]. 

Sparse unstructured matrices may sometimes exhibit approximate dense blocks consisting mostly of 
nonzero entries except only a few zeros inside the blocks. By treating these few zeros as nonzero elements, 
with a little sacrifice of memory, a block ordering may be generated for an iterative solver. Computing 
approximate dense structures may enable us to enlarge existing blocks and to use BLAS3 operations 
more efficiently in the iterative solution, but it may also increase the memory costs and the probability to 
encounter singular blocks during the factorization [8j. Two important performance measures to gauge the 
quality of the block ordering computed are the average block density (avJod) value, defined as the amount 
of nonzeros in the matrix divided by the amount of elements in the nonzero blocks, and the average block 
size ( avJ)s ) value, which is the ratio between the sum of dimensions of the square diagonal blocks divided 
by the number of diagonal blocks. From our computational experience, high average block density values 
around 90% are necessary to prevent the occurrence of singular blocks during the factorization. 
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2.1 The angle-based method 

Approximate dense blocks in a matrix may be computed by numbering consecutively rows and columns 
having a similar nonzero structure. However, this would require a new checksum function that preserves 
the proximity of patterns, in the sense that close patterns would result in close checksum values. 
Unfortunately, this property does not hold true for Ashcraft’s algorithm in its original form. In Jl9] , 
Saad proposed to compare angles of rows (or columns) to compute approximate dense structures in a 
matrix A. Let C be the pattern matrix of A , which by definition has the same pattern as A and nonzero 
values equal to one. The method proposed by Saad computes the upper triangular part of CC T . Entry 
(*, j) is the inner product (the cosine value) between row i and row j of C for j > i. A parameter r is used 
to gauge the proximity of row patterns. If the cosine of the angle between rows i and j is smaller than 
r, row j is added to the group of row i. For r = 1 the method will compute perfectly dense blocks, while 
for r < 1 it may compute larger blocks where some zero entries are padded in the pattern. To speed up 
the search, it may be convenient to run a first pass with the checksum algorithm to detect rows having 
an identical pattern, and group them together; then, in a second pass, each non-assigned row is scanned 
again to determine whether it can be added to an existing group. In practice, however, it may be difficult 
to predict the average block density obtained using a given value of r . For example, the experiments 
reported in Table [l] show that r = 0.58 returns a block density of 86.37% for the VENKAT01 matrix and 
of 45.06% for the STACOM matrix. 


Matrix 

r = 0.56 

r = 0.57 

r = 0.58 

r = 0.59 

r = 0.60 

STACOM 

25.63 

25.68 

45.06 

50.83 

52.02 

K3PLATES 

37.78 

38.73 

58.62 

58.70 

59.16 

OILPAN 

50.08 

50.09 

50.23 

50.23 

90.65 

VENKAT01 

29.71 

29.71 

86.37 

86.37 

86.37 

RAE 

26.40 

26.48 

49.48 

50.71 

51.96 


Matrix 

t = 0.64 

r = 0.65 

r = 0.66 

r = 0.67 

t = 0.68 

RAEFSKY3 

63.32 

63.32 

63.32 

95.23 

95.23 

BMW7ST_1 

49.29 

50.11 

50.66 

68.85 

74.00 

S3DKQ4M2 

64.29 

64.29 

64.29 

97.52 

97.52 

PWTK 

57.05 

57.31 

57.48 

94.23 

94.75 


Table 1: Average block density value (%) obtained from the angle compression algorithm for different 
values of r. 

The cost of Saad’s method is closer to that of checksum-based methods for cases in which a good 
blocking already exists, and in most cases it remains inferior to the cost of the least expensive block 
LU factorization, i.e., block ILU(O). 

2.2 Graph-based compression 

We revisited Saad’s angle-based method to develop a new compression algorithm that computes a block 
ordering having an average block density avJbd, not smaller than a user-specified value /r. This may 
simplify the parameter selection procedure. The method proceeds in two steps. First, using the checksum 
algorithm it groups rows having equal nonzero structure and builds the quotient graph G/B = (Vb,Eb). 
In G/B , nodes corresponding to rows with identical pattern are coalesced into one single node of Ug 
(also called supernode or supervertex). An edge connects supervertices Y and Z of Ug if there exists 
an edge in G = (U, E ) connecting a vertex in Y to a vertex in Z. If A is unsymmetric, we assume to 
operate on the symmetrized graph of A + A T ] thus the edge orientation is not important. Afterwards, the 
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algorithm merges pairs of supernodes (Y, Z), for Z adjacent to Y in G/B, provided that the average block 
density value avJ)d of the new block ordering after this operation does not drop below [i. Otherwise, the 
algorithm will stop to prevent near-singularities during the block factorization. The total size of the rows 
and columns spanned by this new block is 

T = 2 • | adj(Y) U adj(X)\ ■ \ Y U X\ - \Y U X| 2 , 

which is the amount of nonzero rows and columns times the size of the supernode minus the square block 
on the diagonal which we count twice since we count both columns and rows. The nonzeros spanned by 
the new block is 

N = 2- Y, \ ad i( Z )\~ E \adj(Z)n(YUX)\, 

ZeYUX ZEYUX 

which is the amount of adjacent nodes per node inside the supernode minus the amount of nodes inside 
the diagonal block, which is again counted twice. The complete graph-based algorithm is sketched in 
Algorithm [I] It requires only one simple to use parameter /. i. If we desire a block ordering having an 
average block density around 60%, we simply set /i = 0.6. In contrast, a correct tuning of r may require 
to run the full solver to see if a singular block is encountered during the factorization. 

2.3 Experiments 

In Table [3] we give some comparative performance figure to show the viability of the graph algorithm. 
In our runs we attempted to find the optimal value of r by trial and error. By optimal value we mean 
the one that minimizes the number of GMRES iterations required to reduce the initial residual by 6 
orders of magnitude using a standard block incomplete LU factorization as a preconditioner for GMRES. 
The optimal value for the parameter r was calculated by running the angle algorithm with different 
r £ [0.5,1.0], by increments of 0.1 at every run. The results evidence the difficulty to compute a unique 
value which is nearly optimal for every problem. On the other hand, for the graph method we set /r = 0.7 
which gave us a minimum block density of 70% for every matrix. We see that the new compression 
algorithm is very competitive and additionally may be simple to use. In Tables [3] we also report on 
the timing to compute the block ordering by both compression techniques, and for solving the linear 
system. The new graph algorithm is in most cases up to three times slower than the angle algorithm. 
However, this is not a big downside because the compression time is considerably smaller than the total 
solution time, and computing the optimal value of r may require several runs as we explained. Clearly, 
the compression time increases when fj, decreases since we merge more supernodes in this circumstance. 
By the way, both compression methods helped reduce iterations. Without blocking, no convergence was 
achieved in 1000 iterations using pointwise ILUT on the OILPAN, K3PLATES, S3DKQ4M2, OLAFU, 
RAE, NASASRB, CT20STIF, RAEFSKY3, BCSSTK35, STACOM problems at equal or higher memory 
usage. On the other hand, no evident gain was observed from using level-2 BLAS routines in the sparse 
matrix-vector product operation, probably due to the small block size. 


Matrix 

Method 

t/h 

avJ)d (%) 

avJjs 

Blocking 
time (s) 

Solving 
time (s) 

Mem 

Its 

OTT E> A 1\I 

Angle 

0.70 

95.94 

7.36 

0.03 

4.18 

0.26 

198 


Graph 

0.70 

95.02 

7.42 

0.08 

4.17 

0.27 

198 

TTQDT A'TTT'Q 

Angle 

0.60 

59.16 

7.90 

0.00 

0.7 

0.3 

239 

JCVOJT Lj 1 Y 1 EL/D 

Graph 

0.70 

89.50 

5.65 

0.01 

0.7 

0.18 

241 

A7TP1\TE" ATH1 

Angle 

0.70 

99.94 

4.00 

0.02 

0.43 

1.33 

9 


Graph 

0.70 

94.05 

4.28 

0.08 

0.48 

1.58 

9 

DW'TTZ 

Angle 

0.60 

56.95 

12.17 

0.09 

26.38 

6.85 

117 


Graph 

0.70 

78.16 

7.31 

0.35 

32.64 

4.5 

137 
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S3DKQ4M2 

Angle 

1.00 

100.00 

5.93 

0.03 

9.57 

1.09 

214 

Graph 

0.70 

77.92 

7.81 

0.12 

15.1 

1.42 

309 

OLAFU 

Angle 

0.80 

81.75 

6.47 

0.02 

1.2 

3.14 

54 

Graph 

0.70 

79.66 

6.58 

0.11 

1.63 

3.75 

57 

RAE 

Angle 

0.80 

95.83 

4.67 

0.03 

8.85 

9.53 

49 

Graph 

0.70 

86.21 

4.64 

0.13 

15.74 

13.8 

42 

BMW7ST.1 

Angle 

0.70 

77.16 

7.28 

0.08 

0.35 

0.18 

5 

Graph 

0.70 

79.54 

6.65 

0.29 

0.48 

0.17 

9 

NASASRB 

Angle 

0.80 

90.87 

4.24 

0.05 

7.51 

5.23 

30 

Graph 

0.70 

77.62 

4.20 

0.20 

12.39 

7.46 

16 

CT20STIF 

Angle 

0.70 

66.05 

6.55 

0.04 

0.69 

0.18 

44 

Graph 

0.70 

78.42 

4.76 

0.16 

1.18 

0.14 

56 

RAEFSKY3 

Angle 

0.70 

95.23 

8.63 

0.01 

0.08 

0.13 

13 

Graph 

0.70 

77.67 

10.56 

0.02 

0.09 

0.17 

15 

HEART1 

Angle 

0.90 

98.81 

18.62 

0.00 

0.5 

0.78 

151 

Graph 

0.70 

0.00 

0.00 

0.00 

- 

- 

- 

BCSSTK35 

Angle 

0.60 

51.95 

11.03 

0.01 

2.1 

0.29 

209 

Graph 

0.70 

78.72 

6.57 

0.05 

2.66 

0.18 

235 

STACOM 

Angle 

0.90 

97.00 

4.36 

0.00 

0.25 

5.19 

31 

Graph 

0.70 

84.51 

4.47 

0.01 

0.29 

5.65 

33 


Table 3: Experiments with the angle-based and the graph-based 
compression methods. The optimal value of r is used for the angle- 
based algorithm. The value /.i = 0.7 is used for the graph-based 
algorithm in all our runs. 


For the sake of comparison, we also ran some experiments using the PABLO algorithm introduced by 
O’Neil and Szyld in [15] , in combination with block incomplete LU factorization preconditioning. The 
convergence results are reported in Table[4j and a comparison of patterns produced by the two compression 
techniques is shown in Figure[l]for two matrices. We observe that the block ordering computed by PABLO 
may produce larger blocks compared to the graph and angle methods. However, the average block size 
can be significantly smaller, probably due to the design philosophy of PABLO that attempts to maximize 
the density of the diagonal blocks of a matrix. The convergence results show that overall the resulting 
block ordering may be less suitable for block factorization. 

3 The VBARMS method 


The VBARMS method discussed in this paper incorporates compression techniques to maximize 
computational efficiency during the factorization. We recall briefly below the main steps of the algorithm 
and we point the reader to [8] for further details. After permuting the coefficient matrix A in block form 
as 



' An 

A\2 ' 

' Alp 


A » P B APl = 

A 2 i 

41-22 

A-2 p 

, (3) 


1 

■jK 

A p 2 ■ 

i_ 
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Name 

Size 

Application 

nnz(A) 

symmetry 

OILPAN 

73752 

Structural problem 

2148558 

symmetric value 

K3PLATES 

11107 

FE stiffness matrix 

378927 

symmetric value 

VENKAT01 

62424 

Unstructured 2D Euler solver 

1717792 

symmetric structure 

PWTK 

217918 

Pressurized wind tunnel 

11524432 

symmetric value 

S3DKQ4M2 

90449 

Structural mechanics 

2455670 

symmetric value 

OLAFU 

16146 

Structural problem 

1015156 

symmetric value 

RAE 

52995 

Turbulence analysis 

1748266 

symmetric structure 

BMW7ST_1 

141347 

Stiffness matrix 

7318399 

symmetric value 

NASASRB 

54870 

Shuttle rocket booster 

2677324 

symmetric value 

CT20STIF 

52329 

Stiffness matrix engine block 

2600295 

symmetric value 

RAEFSKY3 

21200 

Fluid structure interaction turbulence problem 

1488768 

symmetric structure 

HEART1 

3557 

Quasi-static FEM of a heart 

1385317 

symmetric structure 

BCSSTK35 

30237 

Automobile seat frame 

1450163 

symmetric value 

STACOM 

8415 

Compressible flow 

271936 

symmetric structure 


Table 2: Set and characteristics of test matrix problems. 


Matrix 

avJ)d 

avJbs 

Total 
time (s) 

Mem 

Its 

STACOM 

66.54 

2.38 

6.22 

11.02 

152 

K3PLATES 

83.51 

2.00 

8.94 

5.54 

329 

OLAFU 

89.60 

2.00 

7.66 

3.89 

84 

RAE 

68.28 

2.34 

412.89 

26.75 

1000 


Table 4: Pablo performance and Pablo with VBILUT. Block density refers to the average block density 
of the block ordering, Block size is the average block size, Total time includes the preconditioning 
construction and the solving time, Mem is the ratio between the number of nonzeros in the preconditioner 
and in the matrix. 
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(a) Using the PABLO algorithm 



(b) Using the graph algorithm 


Figure 1: Block patterns computed by different compression methods for the STACOM problem. 


where the diagonal blocks A ti , i = 1,... ,p are n,; x rq, the off-diagonal blocks A,j are rii x rij, and Pb is 
the permutation matrix of the block ordering computed by the compression algorithm, we can represent 


the adjacency graph of A by the quotient graph of A + A T 10 , which is smaller. Let B the partition 
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Algorithm 1 The graph based compression algorithm. 

1: Compute the keys hi = chk(i) for all vertices i £ V = {1,..., n} 

2 : Set processed nodes pi = 0 Vi = 1,..., n 
3: Make a set of supernodes V = 0 

4: Set s to the indices V sorted by the corresponding value in k 
5: for i = si,..., s n do 
6: if pi ^ 1 then 

7: Add a new supernode Yi to V 

8: for J = .s l+ i..... s n do 

9: if ki kj then 

10: break 

11: if adj(i ) = adj(j) then 

12: Add node j to Yi 

13: Set pj = 1 

14: Make a map M : i {Z £ V| i £ adj(Z)} 

15: for A' € V do 

16: for Z £ Uig.Y M(i) do 

17: Update the average block density value av.bd after merging A' and Z 

18: if avJbd > p then 

19: X = XUZ 

20 : V = V\Z 


into blocks given by ^ . The quotient graph Q/B = (Vs, Eg) is constructed by coalescing the vertices of 
each block A; n , for i = 1 ,,p, into one supervertex (or supernode) Y t . We can write 


V B = {Yi,..., Y p } , E b = {(Yi, Yj) | 3v £ Y u w £ Yj s.t. (v, w) £ E} . 

where (V,E) is the graph of A + A T . An edge connects two supervertices Y, and Yj if there exists an 

edge of ( V, E) connecting a vertex of the block An to a vertex of the block A n . 

The complete pre-processing and factorization process of VBARMS consists of the following steps. 

Step 1 Using the angle-based or the graph-based compression algorithms described in Section[2j compute 
a block ordering P B of A such that, after permutation, the matrix P B APg has fairly dense nonzero 
blocks. 

Step 2 Scale the matrix permuted at Step 1 as S\P B AP B S^, where Si and S 2 are two diagonal matrices 
such that the 1-norm of the largest entry in each row and column becomes smaller or equal than 
one. 


Step 3 Apply the block independent sets (or the nested dissection) algorithms to the quotient graph 
Q/B and compute an independet sets ordering Pj of Q/B. Upon permutation by Pj, the matrix 
obtained at Step 2 will write as 


P I S l P B AP] 3 S 2 Pj =^C 


(4) 


We use a simple weighted greedy algorithm for computing the ordering Pj 
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In the 2x2 partitioning Q, the upper left-most matrix D £ R mxm is block diagonal like in 
ARMS. However, due to the block permutation (Step 1), the diagonal blocks ZA of D are block 
sparse matrices while in ARMS they are sparse unstructured. The matrices F £ R mx ( rl - m ) i E £ 
l(" _m )x m , C £ are a j g0 bi oc ]j sparse, because of the same reason. 







Step 4 Factorize the matrix in Q as 


( D F \ _ ( L 0 \ ( U L~ 1 F \ 

y e c ) ~y eu- 1 I ) X ^ 0 A- : J ’ 

where I is the identity matrix of appropriate size, and 

A 1= C - ED~ 1 F. 


(5) 

( 6 ) 


is the Schur complement corresponding to C. Observe that the Schur complement is also block 
sparse and it has the same block structure as matrix C. 


Steps 2-4 can be repeated on the reduced system a few times until the Schur complement is small 
enough. Denoting by An the reduced Schur complement matrix at level l, for t > 1, after scaling and 
preordering A( a system with coefficient matrix 


Pf> D[ e) A l D ( £\P < f ) j T = 


De F( 

E e Cn 


U 0\ ( U t LJ 1 F t \ 

EnUf 1 I J \ 0 At +1 J 


(7) 


needs to be solved, with D e £ F t £ I ro ' x ("'- ra '), E e £ R (n <- ra ') xm ', C e £ 

and 

A e+ 1 =C e - EeD^Ft £ ^(ne-m e )x(u e - mt )^ ( 8 ) 

Calling 

x«=f b,= ( f ‘ 

V z l ) \ 9l 

the unknown solution vector and the right-hand side vector of system ([7|, respectively, the solution 
process with the above multilevel VBARMS factorization consists of a level-by-level forward elimination 
step followed by an exact solution on the last reduced subsystem and a suitable inverse permutation. The 
complete solving phase is sketched in Algorithm [2] 


Algorithm 2 VBARMS_Solve (A^ + i, bt ). The solving phase with the VBARMS method. 

Require: l £ N*, i m ax £ N*, be = (fi,ge) T 
1: Solve L t y = f e 
2: Compute g' t =ge~ E e U^ 1 y 
3: if l = Imax then 
4: Solve A^ +1 z^ = g' e 

5: else 

6: Call VBARMS_Solve CA £+1 , g ' £ ) 

7: Solve Ueye = ^y — LJ 1 F t z e ^ 


In VBARMS we perform the factorization approximately for memory efficiency. We use block ILU 
factorization with threshold to invert inexactly both the upper left-most matrix Dn ~ LfUf. at each 
level £, and the last level Schur complement matrix An ma:c « LsUs- The block ILU method used in 
VBARMS is a straightforward block variant of the one-level pointwise ILUT algorithm. We drop small 
blocks B £ W mBXnB in L(, Ue, Ls, Us whenever < t, for a given user-defined threshold t. The 

block pivots in block ILU are inverted exactly by using Gaussian Elimination with partial pivoting. Every 
operation performed during the factorization calls optimized level-3 BLAS routines [9 , taking advantage 
of the finest block structure appearing in the matrices Dn, Fn, Ee, Ce. Recall that this fine-level block 
structure results from the block ordering Pg and consists of small, usually dense, blocks in the diagonal 
blocks of De as well as in the matrices Ee, Fe, Ce- We do not drop entries in the construction of the 
Schur complement except at the last level. The same threshold is applied in all these operations. 
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Matrix 

Compression 

Method 

Factorization 

Solving 

Total 

Mem 

Its 

time (s) 

time (s) 

time (s) 


Bsize = 18.62 

VBARMS 

0.12 

0.43 

0.55 

0.83 

147 

HEART1 

Bdensity = 

ILUT 

- 

- 

- 

- 

- 

98.81 

VBILUT 

- 

- 

- 

- 

- 


t = 0.9 

ARMS 

- 

- 

- 

- 

- 


Bsize = 56.95 

VBARMS 

12.71 

25.02 

37.73 

4.42 

144 

PWTK 

Bdensity = 

ILUT 

- 

- 

- 

- 

- 

12.17 

VBILUT 

- 

- 

- 

- 

- 


II 

o 

bi 

ARMS 

- 

- 

- 

- 

- 


Bsize = 4.67 

VBARMS 

1.45 

1.28 

2.72 

2.46 

34 

RAE 

Bdensity = 

ILUT 

- 

- 

- 

- 

- 

95.83 

VBILUT 

- 

- 

- 

- 

- 


t = 0.8 

ARMS 

- 

- 

- 

- 

- 


Bsize = 9.18 

VBARMS 

2.56 

3.68 

6.23 

3.86 

76 

NASASRB 

Bdensity = 

VBILUT 

1.5 

23.02 

24.52 

4.58 

464 

47.35 

ILUT 

- 

- 

- 

- 

- 


t = 0.6 

ARMS 

- 

- 

- 

- 

- 


Bsize = 7.01 

VBARMS 

0.77 

1.63 

2.39 

2.57 

42 

OILPAN 

Bdensity = 

ILUT 

0.06 

32.02 

32.08 

0.02 

952 

99.94 

VBILUT 

- 

- 

- 

- 

- 


t = 0.8 

ARMS 

- 

- 

- 

- 

- 


Bsize = 11.03 

VBILUT 

0.09 

2.95 

3.03 

1.08 

243 

BCSSTK35 

Bdensity = 

VBARMS 

0.15 

3.22 

3.36 

0.95 

242 

51.95 

ILUT 

- 

- 

- 

- 

- 


r = 0.6 

ARMS 

- 

- 

- 

- 

- 


Table 5: Assessment performance of VBARMS against other popular preconditioning methods. The 
symbol means that no convergence is achieved after 1000 iterations of GMRES. 
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Algorithm 3 General ILU Factorization, IKJ Version. 

Require: A nonzero pattern set V 
l: for i = 2,..., n do 
2: for k = 1,..., i — 1 do 

3: if (*, j) ^ V then 

4 : Oik — 0,ik/(lkk 

5: for j = k + 1,..., n do 

6: if ( i,j ) ^ V then 

7 : (L^j 0>ij OikOkj 


3.1 The new implementation of VBARMS 

The code for the VBARMS method is developed in the C language and is adapted from the existing ARMS 
code available in the ITSOL package [l3]. The compressed sparse storage format of ARMS is modified to 
store block vectors and block matrices of variable size as a collection of contiguous nonzero dense blocks 
(we refer to this data storage format as VBCSR). However, the implementation used in this paper is 
different and noticeably faster than the one described in |8j. In the old implementation, the approximate 
transformation matrices EgUff 1 and Lj 1 Fi appearing in Eqn ([t]) at step I were explicitly computed 
and temporarily stored in the VBCSR format. They were discarded from the memory immediately 
after assembling Ag +1 . In the new implementation, we first compute the factors Lg, Ug and Lf x Fg by 
performing a variant of the IKJ version of the Gaussian Elimination algorithm (Algorithm [3]), where 
index I runs from 2 to mg, index I\ from 1 to (I — 1) and index J from ( K + 1) to ng. This loop 
applies implicitly Lf 1 to the block row [Dg , Fg ] to produce [Ug , LgFg] . In the second loop, Gaussian 
Elimination is performed on the block row [Eg , Cg] using the multipliers computed in the first loop to 
give EgU g and an approximation of the Schur complement Ag + \. We explicitly permute the matrix 
after Step 1 at the first level as well as the matrices involved in the factorization at each new reordering 
step. The improvement of efficiency obtained with the new implementation is noticeable, as appears from 
the results shown in Table [6j Finally, in Table [5] we assess the performance of the VBARMS method 
against other popular preconditioning techniques; we report on the number of GMRES iterations required 
to reduce the initial residual by 6 orders of magnitude using a block incomplete LU factorization as a 
preconditioner for GMRES. The results show a remarkable robustness for low to moderate memory cost. 

4 Using VBARMS in parallel computing 


In the experiments reported in this section the VBARMS method is used for solving large linear systems on 
distributed memory computers; its overall performance are assessed against the parallel implementation 
of the ARMS solver provided in the pARMS package 14 . On multicore machines, the quotient graph 


G/B is split into distinct subdomains using a parallel graph partitioner, and each of them is assigned to a 
different core. We follow the parallel framework described in 14 which separates the nodes assigned to 


the ith subdomain into interior nodes , that are those coupled only with local variables by the equations, 
and interface nodes, those that may be coupled with local variables stored on processor i as well as with 
remote variables stored on other processors (see Figure [2]). The vector of the local unknowns Xi and the 
local right-hand side bi are split accordingly in two separate components: the subvector corresponding to 
the internal nodes followed by the subvector of the local interface variables 


Xi = 


Ui 

Vi 


bi = 


fi 

9i 
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Matrix 

Implementation 

Factorization 
time (s) 

Solving 
time (s) 

Total 
time (s) 

Mem 

Its 

HEART1 

New 

0.12 

0.43 

0.55 

0.83 

147 

Old 

0.36 

0.33 

0.69 

0.86 

113 

PWTK 

New 

12.71 

25.02 

37.73 

4.42 

144 

Old 

90.73 

26.08 

116.81 

4.95 

140 

RAE 

New 

1.45 

1.28 

2.72 

2.46 

34 

Old 

5.12 

1.15 

6.27 

2.71 

30 

NASASRB 

New 

2.56 

3.68 

6.23 

3.86 

76 

Old 

15.54 

3.34 

18.88 

4.06 

64 

OILPAN 

New 

0.77 

1.63 

2.39 

2.57 

42 

Old 

5.64 

1.29 

6.93 

2.62 

32 

BCSSTK35 

New 

0.15 

3.22 

3.36 

0.95 

242 


Old 

- 

- 

- 

- 

- 


Table 6: Comparative experiments with the old and the new VBARMS codes, implementing a different 
partial (block) factorization step. The symbol means that no convergence is achieved after 1000 
iterations of GMRES. 



Figure 2: Local domain from a physical viewpoint. 


The rows of A corresponding to the nodes belonging to the ith subdomain are assigned to the ith processor. 
They are naturally separated into a local matrix A t acting on the local variables Xi = ( Ui,yi) T , and an 
interface matrix Ui acting on the remotely stored subvectors of the external interface variables ext . 
Hence we can write the local equations on processor i as 


A.iXi + Ui 

,extVi,ext 


or, in expanded form, as 


f B i Fi \ ( m 

\ Ei Ct ) \ y i 


0 

11 U X, B,jHi 



(9) 


where iV, is the set of subdomains that are neighbors to subdomain i and the submatrix Eijyj accounts 
for the contribution to the local equation from the jth neighboring subdomain. Notice that matrices J3,;, 


12 







Ci, Ei and Fi still preserve the finest block structure imposed by the block ordering Pb . At this stage, 
the VBARMS method described in Section [3] can be used as a local solver for different types of global 
preconditioners. 

In the simplest parallel implementation, the so-called block-Jacobi preconditioner, the sequential 
VBARMS method can be applied to invert approximately each local matrix A\. The standard Jacobi 
iteration for solving Ax = b is defined as 

x n+1 =x n + D~ x (b - Ax n ) = D~ x ( Nx n + b) 

where D is the diagonal of A, N = D — A and Xg is some initial approximation. In cases we have a graph 
partitioned matrix, the matrix D is block diagonal and the diagonal blocks of D are the local matrices Ai . 
The interest to consider this basic approach is its inherent parallelism, since the solves with the matrices 
A, are performed independently on all the processors and no communication is required. 

If the diagonal blocks of the matrix D are enlarged in the block-Jacobi method so that they overlap slightly, 
the resulting preconditioner is called Schwarz preconditioner. Consider again a graph partitioned matrix 
with N nonoverlapping sets Wf, i = 1,..., N and Wg = [J i=] Wg. We define a J-overlap partition 

N 

w s = (J wf 

i=1 

where Wf = adj (W/^ 1 ) and <5 > 0 is the level of overlap with the neighbouring domains. For each 
subdomain, we define a restriction operator Rf, which is an n x n matrix with the (j,j )th element equal 
to 1 if j € Wf , and zero elsewhere. We then denote 

Ai = RfARf. 

The global preconditioning matrix Mb as is defined as 

S 

M r \s = Y. R - A > Ar t 

i= 1 

and named as the Restricted Additive Schwarz preconditioner (RAS) [16 
preconditioning step is still parallel, as the different components of the error update are formed 
independently. However, some communication is required in the final update, as the components are 
added up from each subdomain due to overlapping. 

A third global preconditioner that we consider in this study is based on the Schur complement approach. 
In Eqn we can eliminate the vector of interior unknowns zq from the first equations to compute the 
local Schur complement system 

SiVi + ^2 E ijVj = 9i ^ = 9ii 

j£Ni 

where Si denotes the local Schur complement matrix 

Si = Ci — E i B~ 1 F i . 

The local Schur complement equations considered altogether write as the global Schur complement system 


Co 

to 


{ yi \ 


f 9 'i ) 

E 21 S 2 ... E 2p 


yi 


92 

\ Epi Ep— 1^2 • • • Sp j 


\Vp / 


\9'J 


20 


Note that the 
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where the off-diagonal matrices Eij are available from the parallel distribution of the linear system. One 
preconditioning step with the Schur complement preconditioner consists in solving approximately the 


global system (10), and then recovering the u t variables from the local equations as 

«i = Br'lfi ~ F iVi \ 


( 11 ) 


at the cost of one local solve. We solve the global system (10) by running a few steps of the GMRES 


method preconditioned by a block diagonal matrix, where the diagonal blocks are the local Schur 
complements 5). The factorization 

Si = L s . Usi 

is obtained as by-product of the LU factorization of the local matrix Ai, 

L Bi 


A, = 


EiU B 


-l 


0 

L Si 


U Bi L^Fi 


0 


Bi 

U s , 


which is by the way required to compute the u t variables in (11). 


4.1 Experiments 

Some preliminary results with a parallel MPI-based implementation of VBARMS for distributed memory 
computers, reported in a conference contribution 17 , revealed promising performance against the parallel 
ARMS method and the conventional ILUT method. They showed that exposing dense matrix blocks 
during the factorization may lead to more efficient and more stable parallel solvers. The parallel 
implementation of VBARMS considered in this study differs from the one presented in |7| in one important 
aspect. In the old implementation we used a sequential graph partitioner, namely the recursive dissection 
partitioner from the METIS package 12 , to split the quotient graph G/B and then assign the computed 


partitions to different processors. In the new implementation, the quotient graph is initially distributed 
amongst the available processors; then, the built-in parallel hypergraph partitioner available in the Zoltan 
package [2] is applied on the distributed data structure to compute an optimal partitioning of the quotient 
graph that can minimize the amount of communications. 

In the experiments reported in Table [H] we notice the significant reduction of CPU time spent for the 
graph partitioning operation in the new implementation of VBARMS; note that the numerical efficiency 
of the solvers is generally well preserved. The matrix problems used are listed in Table [TJ The parallel 
experiments were run on the large-memory nodes (32 cores/node and 1TB of memory) of the TACC 
Stampede system located at the University of Texas at Austin. TACC Stampede is a 10 PFLOPS (PF) 
Dell Linux Cluster based on 6,400+ Dell PowerEdge server nodes, each outfitted with 2 Intel Xeon E5 
(Sandy Bridge) processors and an Intel Xeon Phi Coprocessor (MIC Architecture). We linked the vendor 
BLAS library on Stampede, which has BLAS via MKL loaded by default and is multi-threaded. We used 
the Flexible GMRES (FGMRES) method 17 as Krylov subspace method, a tolerance of l.Oe — 6 in the 


stopping criterion and a maximum number of iteration equal to 1000. Memory costs were calculated as 
the ratio between the sum of the number of nonzeros in the local preconditioners, and the sum of the 
number of nonzeros in the local matrices A,. Overall, the Restricted Additive Schwarz solver showed 
better performance against the Block Jacobi and the Schur-complement methods. 
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Name 

Size 

Application 

nnz(A) 

AUDIKWA 

943695 

Structural problem 

77651847 

LDOOR 

952203 

Structural problem 

42493817 

STA004 

891815 

Fluid Dynamics 

55902989 

STA004 

891815 

Fluid Dynamics 

55902989 


Table 7: Set and characteristics of test matrix problems. 


Matrix 

Method 

Graph 

Graph 

Factorization 

Solving 

Total 

Its 

Mem 

type (s) 

time (s) 

time (s) 

time (s) 

time (s) 


BJ+VBARMS 

METIS (seq.) 

54.5 

18.88 

51.35 

70.23 

136 

3.13 


Zoltan (par.) 

5.2 

17.28 

37.98 

55.26 

117 

2.74 

AUDIKW_1 

RAS+VBARMS 

METIS (seq.) 

54.2 

19.54 

26.68 

46.22 

46 

2.93 

Zoltan (par.) 

5.3 

22.75 

22.24 

44.99 

52 

2.87 


SCHUR+VBARMS 

METIS (seq.) 
Zoltan (par.) 

54.4 

5.3 

82.72 

166.09 

295.11 

327.06 

377.83 

493.15 

69 

59 

6.21 

4.60 


BJ+VBARMS 

METIS (seq.) 

30.0 

1.29 

25.10 

26.40 

345 

1.95 


Zoltan (par.) 

1.1 

1.04 

18.09 

19.12 

273 

1.95 

LDOOR 

RAS+VBARMS 

METIS (seq.) 

29.0 

1.56 

13.40 

14.95 

200 

2.00 

Zoltan (par.) 

1.1 

1.12 

12.73 

13.85 

196 

1.99 


SCHUR+VBARMS 

METIS (seq.) 

29.0 

5.81 

16.75 

22.56 

54 

3.63 


Zoltan (par.) 

1.1 

5.64 

4.78 

10.42 

37 

3.32 


BJ+VBARMS 

METIS (seq.) 
Zoltan (par.) 

79.4 

2.5 

7.53 

5.11 

42.56 

24.12 

50.08 

29.23 

90 

72 

3.61 

3.61 

STA004 

RAS+VBARMS 

METIS (seq.) 

81.7 

9.55 

34.27 

43.82 

42 

3.85 

Zoltan (par.) 

2.6 

7.90 

23.09 

30.99 

34 

3.31 


SCHUR+VBARMS 

METIS (seq.) 

81.4 

17.46 

135.58 

153.04 

90 

5.29 


Zoltan (par.) 

2.5 

16.05 

113.24 

129.28 

88 

5.40 


BJ+VBARMS 

METIS (seq.) 
Zoltan (par.) 

81.9 

2.3 

11.36 

9.45 

85.77 

50.17 

97.14 

59.62 

227 

170 

4.77 

4.78 

STA008 

RAS+VBARMS 

METIS (seq.) 
Zoltan (par.) 

81.8 

2.4 

15.01 

12.90 

67.98 

46.52 

82.99 

59.42 

101 

97 

5.10 

5.07 


SCHUR+VBARMS 

METIS (seq.) 
Zoltan (par.) 

81.2 

2.4 

56.20 

66.42 

564.75 

490.25 

620.94 

556.67 

188 

201 

8.94 

9.83 


Table 8: Performance comparison of serial and parallel graph partition on 16 processors. Notation: P-N 
means number of processors, G-Type means graph partitioning strategy, G-time means partitioning 
timing cost, P-T means preconditioning construction time, I-T iterative solution time, Mem means 
memory costs. 


4.2 A case study in large-scale turbulent flows analysis 

We finally get back to the starting point that motivated this study. In this section we present a 
performance analysis with the parallel VBARMS implementation for solving large block structured linear 
systems arising from an implicit Newton-Krylov formulation of the Reynolds Averaged Navier Stokes 
(briefly, RANS) equations. Although explicit multigrid techniques have dominated the Computational 
Fluid Dynamics (CFD) arena for a long time, implicit methods based on Newton’s rootfinding algorithm 
are recently receiving increasing attention because of their potential to converge in a very small number of 
iterations. One of the most recent outstanding examples on the use of implicit unstructured RANS CFD 
is provided in the article 25 , which reports the turbulent analysis of the flow past three-dimensional 


wings using a vertex-based unstructured Newton-Krylov solvers. Practical implicit CFD solvers need to 
be combined with ad-hoc preconditioners to invert efficiently the large nonsymmetric linear system at 
each step of Newton’s algorithm. 
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Throughout this section we use standard notation for the kinematic and thermodynamic variables: we 
denote by u the flow velocity, by p the density, p is the pressure, T is the temperature, e and h are 
respectively the specific total energy and enthalpy, v is the laminar kinematic viscosity and D is a scalar 
variable related to the turbulent eddy viscosity via a damping function. The quantity a denotes the sound 
speed or the square root of the artificial compressibility constant in case of the compressible, respectively 
incompressible, flow equations. In the case of high Reynolds number flows, we account for turbulence 
effects by the RANS equations that are obtained from the Navier-Stokes (NS) equations by means of a 
time averaging procedure. The RANS equations have the same structure as the NS equations with an 
additional term, the Reynolds’ stress tensor, that accounts for the effects of the turbulent scales on the 
mean field. Using Boussinesq’s approximation, the Reynolds’ stress tensor is linked to the mean velocity 
gradient through the turbulent (or eddy) viscosity. In our study, the turbulent viscosity is modeled using 
the Spalart-Allmaras one-equation model |22| . The physical domain is partitioned into nonoverlapping 
control volumes drawn around each gridpoint by joining, in two space dimensions, the centroids of gravity 
of the surrounding cells with the midpoints of all the edges that connect that gridpoint with its nearest 
neighbors, as shown in Figure [3] 



(a) The flux balance of cell T is scattered among 
its vertices. 



(b) Gridpoint i gathers the fractions of cell 
residuals from the surrounding cells. 


Figure 3: Residual distribution concept. 


Given a control volume Ci, fixed in space and bounded by the control surface dCi with inward normal ft, 
we write the governing conservation laws of mass, momentum, energy and turbulence transport equations 
as 




' ■ F dS — 


dCi 


ft ■ GdS + 


\dV, 


dCi 


>Ci 


( 12 ) 


where we denote by q the vector of conserved variables. For compressible flows, we have q = (p, pe, pu, D) , 
and for incompressible, constant density flows, q = ( p,u,D ) . In (121, the vector operators F and G 
represent the inviscid and viscous fluxes, respectively. For compressible flows, we have 


F = 


( pu \ 


° i 

puh 

, G~ 1 

u ■ t + Vg 

puu + pi 

’ Reoc 

T 

\ Du ) 


V ^[O' + ^V*] / 

nt density flows, 


/ a 2 u \ 

i / 

0 \ 

uu + pi 

, G= — 

T ) 

\ Du J 


, £ [(!/+*) V*] J 
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where r is the Newtonian stress tensor. The source term vector s has a non-zero entry only in the row 
corresponding to the turbulence transport equation, which takes the form 


Cbl [1 - ft 2 \ Sv + 


1 


aRe . 


q ,2 (voy 


i 

Re 


f Cbl f 

Cwljw n Jt2 



V 


d 


(13) 


For a description of the various functions and constants involved in (13) we refer the reader to 22 


We consider a fluctuation splitting approach to discretize in space the integral form of the governing 
equations (12) over each control volume C). The flux integral is evaluated over each triangle (or 
tetrahedron) in the mesh, and then split among its vertices [H] (see Figure [3]), so that we may write 
from Eq. 

X 

-'G, T3i 


where 


Tt 


n ■ F dS — 


dT 


n-GdS - 


sdV 


dT 


is the flux balance evaluated over cell T and <p[ is the fraction of cell residual scattered to vertex i. Upon 
discretization of the governing equations, we obtain a system of ordinary differential equations of the 
form 




(14) 


where t denotes the pseudo time variable, M is the mass matrix and f(q) represents the nodal residual 
vector of spatial discretization operator, which vanishes at steady state. The residual vector is a (block) 
array of dimension equal to the number of meshpoints times the number of dependent variables, ?n; for a 
one-equation turbulence model, m = d + 3 for compressible flows and m = d + 2 for incompressible flows, 


d being the spatial dimension. If the time derivative in equation (14) is approximated using a two-point 


one-sided finite difference (FD) formula we obtain the following implicit scheme: 




(15) 


dr 


We use a finite difference approximation of 
the Jacobian, where the individual entries of the vector of nodal unknowns are perturbed by a small 


where we denote by J the Jacobian of the residual —-. 

dq 


amount e and the nodal residual is then recomputed for the perturbed state. Eq. (151 represents a large 


nonsymmetric sparse linear system of equations to be solved at each pseudo-time step for the update of 
the vector of the conserved variables. The nonzero pattern of the sparse coefficient matrix is symmetric; 
on average, the number of non-zero (block) entries per row in our discretization scheme equals 7 in 2D 
and 14 in 3D. Choice of the iterative solver and of the preconditioner can have a strong influence on 
computational efficiency, especially when the mean flow and turbulence transport equations are solved in 
fully coupled form like we do. 

We consider turbulent incompressible flow analysis past a three-dimensional wing illustrated in Fig. [3] 


26 


The geometry, called DPW3 Wing-1, was proposed in the 3rd AIAA Drag Prediction Workshop 
Flow conditions are 0.5° angle of attack and Reynolds number based on the reference chord equal to 
5 • 10 6 . The freestream turbulent viscosity is set to 10% of its laminar value. 

In Table [9] we show experiments with parallel VBARMS on the first four meshes of the DPW3 Wing-1 
problem. We illustrate only examples with the parallel graph partitioning strategy described in Section [3] 
In Table [To] we report on only one experiment on the largest mesh, as this is a resource demanding 
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Ref. Area, 

S = 290322 

mm 2 

= 450 in 2 

Ref. Chord, c = 197.556 

mm 

= 7.778 in 

Ref. Span, 

b = 1524 mm 

= 60 in 

RANSl 


n = 4918165 

nnz 

= 318370485 

RANS2 


n = 4918165 

nnz 

= 318370485 

RANS3 


n = 9032110 

nnz 

= 670075950 

RANS4 


n = 12085410 

nnz 

= 893964000 

RANS5 


n = 22384845 

nnz 

= 1659721325 


Figure 4: Geometry and mesh characteristics of the DPW3 Wing-1 problem proposed in the 3rd AIAA 
Drag Prediction Workshop. Note that problems RANSl and RANS2 correspond to the same mesh, and 
are generated at two different Newton steps. 


problem. In Table 11 we perform a strong scalability study on the problem denoted as RANS2 by 
increasing the number of processors. Finally, in Table [I2| we report on comparative results with parallel 
VBARMS against other popular solvers. The method denoted as pARMS is the solver described in 14 


using default parameters. The results of our experiments confirm the same trend of performance shown 
on general problems. The proposed VBARMS method is remarkably efficient for solving block structured 
linear systems arising in applications in combination with conventional parallel global solvers such as 
in particular the Restricted Additive Schwarz preconditioner. A truly parallel implementation of the 
VBARMS method that may offer better numerical scalability will be considered as the next step of this 
research. 


Matrix 

Method 

Graph 
time (s) 

Factorization 
time (s) 

Solving 
time (s) 

Total 
time (s) 

Its 

Mem 


BJ+VBARMS 

17.3 

8.58 

41.54 

50.13 

34 

2.98 

RANSl 

RAS+VBARMS 

17.4 

10.08 

42.28 

52.37 

19 

3.06 


SCHUR+VBARMS 

17.6 

11.94 

55.99 

67.93 

35 

2.57 


BJ+VBARMS 

17.0 

16.72 

70.14 

86.86 

47 

4.35 

RANS2 

RAS+VBARMS 

16.8 

21.65 

80.24 

101.89 

39 

4.49 


SCHUR+VBARMS 

17.5 

168.85 

173.54 

342.39 

24 

6.47 


BJ+VBARMS 

27.2 

99.41 

187.95 

287.36 

154 

4.40 

RANS3 

RAS+VBARMS 

25.2 

119.32 

90.47 

209.79 

71 

4.48 


SCHUR+VBARMS 

22.0 

52.65 

721.67 

774.31 

140 

4.39 


BJ+VBARMS 

51.5 

12.05 

105.89 

117.94 

223 

3.91 

RANS4 

RAS+VBARMS 

43.9 

14.05 

91.53 

105.58 

143 

4.12 


SCHUR+VBARMS 

39.3 

15.14 

289.89 

305.03 

179 

3.76 


Table 9: Experiments on the DPW3 Wing-1 problem. The RANSl, RANS2 and RANS3 test cases are 
solved on 32 processors, whereas the RANS4 problem on 128 processors. 


Matrix 

Method 

Total time (s) 

Its 

Mem 

RANS5 

RAS+VBARMS 

291.42 

235 

4.05 


Table 10: RANS5 problem is solved on 128 processors. 
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Solver 

Number of 
processors 

Graph 
time (s) 

Total 
time (s) 

Its 

Mem 


8 

38.9 

388.37 

27 

5.70 


16 

28.0 

219.48 

35 

5.22 

RAS+VBARMS 

32 

17.0 

101.49 

39 

4.49 


64 

16.0 

54.19 

47 

3.91 


128 

18.2 

28.59 

55 

3.39 


Table 11: Strong scalability study on the RANS2 problem using parallel graph partitioning. 


Matrix 

Method 

Factorization 
time (s) 

Solving 
time (s) 

Total 
time (s) 

Its 

Mem 


pARMS 

- 

- 

- 

- 

6.63 

RANS3 

BJ+VBARMS 

99.41 

187.95 

287.36 

154 

4.40 


BJ+VBILUT 

20.45 

8997.82 

9018.27 

979 

13.81 


pARMS 

- 

- 

- 

- 

5.38 

RANS4 

BJ+VBARMS 

12.05 

105.89 

117.94 

223 

3.91 


BJ+VBILUT 

1.16 

295.20 

296.35 

472 

5.26 


Table 12: Experiments on the DPW3 Wing-1 problem. The RANS3 test case is solved on 32 processors 
and the RANS4 problem on 128 processors. The dash symbol — in the table means that in the GMRES 
iteration the residual norm is very large and the program is aborted. 

5 Conclusions 

We have presented a parallel MPI-based implementation of a new variable block multilevel ILU 
factorization preconditioner for solving general nonsymmetric linear systems. One nice feature of the 
proposed solver is that it detects automatically exact or approximate dense structures in the coefficient 
matrix. It exploits this information to maximize computational efficiency. We have also introduced a 
modified compression algorithm that can find these approximate dense blocks structures, and requires 
only one simple to use parameter. The results show that the solver has nice parallel performance, also 
thanks to the use of a parallel graph partitioner, and it may be noticeably more robust than other 
state-of-the-art methods that do not exploit the fine-level block structure of the underlying matrix. 


6 Acknowledgements 

The work of M. Sosonkina was supported in part by the Air Force Office of Scientific Research under 
the AFOSR award FA9550-12-1-0476, and by the National Science Foundation grants NSF/OCI0941434, 
0904782, 1047772. The authors acknowledge the Texas Advanced Computing Center (TACC) at the 
University of Texas at Austin for providing HPC resources that have contributed to the research results 
reported in this paper. URL: http://www.tacc.utexas.edu. The authors are grateful to Sven Baars 
for his assistance in implementing some of the algorithms described in the paper, and to the reviewers 
for their insightful comments that helped much improve the presentation. 


References 

[1] C. Ashcraft. Compressed graphs and the minimum degree algorithm. SIAM J. Scientific Computing , 
16(6):1404-1411, 1995. 


19 


















[2] O. Axelsson and P. S. Vassilevski. Algebraic multilevel preconditioning methods, I. Numer. Math., 
56:157 177,1989. 

[3] O. Axelsson and P. S. Vassilevski. Algebraic multilevel preconditioning methods. II. SIAM J. Numer. 
Anal., 27:1569-1590, 1990. 

[4] Erik Boman, Karen Devine, Lee Ann Fisk, Robert Heaphy, Bruce Hendrickson, Vitus Leung, 
Courtenay Vaughan, Umit Catalyurek, Doruk Bozdag, and William Mitchell. Zoltan home page, 
http://www.cs.sandia.gov/Zoltan,1999. 

[5] A. Bonfiglioli. Fluctuation splitting schemes for the compressible and incompressible Euler and 
Navier-Stokes equations. IJCFD, 14:21-39, 2000. 

[6] E.F.F. Botta, A. van der Ploeg, and F.W. Wubs. Nested grids ILU-decomposition (NGILU). Journal 
of Computational and Applied Mathematics, 66:515-526, 1996. 

[7] B. Carpentieri, J. Liao, and M. Sosonkina. Parallel Processing and Applied Mathematics, volume 
8385 of Lecture Notes in Computer Science, chapter Variable block multilevel iterative solution of 
general sparse linear systems, pages 520-530. In R. Wyrzykowski, J. Dongarra, K. Karczewski, and 
J. Wasniewski. Springer-Verlag., 2014. 

[8] B. Carpentieri, J. Liao, and M. Sosonkina. VBARMS: A variable block algebraic recursive multilevel 
solver for sparse linear systems. Journal of Computational and Applied Mathematics, 259 (A): 164- 
173, 2014. 

[9] J.J. Dongarra, J. Du Croz, I. S. Duff, and S. Hammarling. A set of level 3 basic linear algebra 
subprograms. ACM Trans. Math. Softw., 16:1- 17, 1990. 

[10] A. George and J. W. Liu. Computer Solution of Large Sparse Positive Definite Systems. Prentice- 
Hall, Englewood Cliffs, New Jersey, 1981. 

[11] A. Gupta and T. George. Adaptive techniques for improving the performance of incomplete 
factorization preconditioning. SIAM J. Sci. Comput., 32(1):84-110, 2010. 

[12] G. Karypis and V. Kumar. Metis: A software package for partitioning unstructured graphs, 
partitioning meshes, and computing fill-reducing orderings of sparse matrices version 4.0. http: 
//glaros.dtc.umn.edu/gkhome/views/metis University of Minnesota, Department of Computer 
Science / Army HPC Research Center Minneapolis, MN 55455. 

[13] Na Li, B. Suchomel, D. Osei-Kuffuor, and Y. Saad. ITSOL: iterative solvers package. 

[14] Z. Li, Y. Saad, and M. Sosonkina. pARMS: a parallel version of the algebraic recursive multilevel 
solver. Numerical Linear Algebra with Applications, 10:485-509, 2003. 

[15] J. O’Neil and D.B. Szyld. A block ordering method for sparse matrices. SIAM J. Scientific and 
Statistical Computing, 11(5):811—823, 1990. 

[16] A. Quarteroni and A. Valli. Domain decomposition methods for partial differential equations. 
Clarendon Press Oxford, 1999. 

[17] Y. Saad. A flexible inner-outer preconditioned GMRES algorithm. SIAM J. Scientific and Statistical 
Computing, 14:461-469, 1993. 

[18] Y. Saad. ILUM: A multi-elimination ILU preconditioner for general sparse matrices. SIAM J. 
Scientific Computing, 17(4):830-847, 1996. 


20 



[19] Y. Saad. Finding exact and approximate block structures for ilu preconditioning. SIAM J. Sci. 
Compute 24(4): 1107-1123, 2002. 

[20] Y. Saad. Iterative Methods for Sparse Linear Systems. SIAM, 2nd edition, 2003. 

[21] Y. Saad and B. Suchomel. ARMS: An algebraic recursive multilevel solver for general sparse linear 
systems. Numerical Linear Algebra with Applications , 9(5):359-378, 2002. 

[22] P.R. Spalart and S.R. Allmaras. A one-equation turbulence model for aerodynamic flows. La 
Recherche-Aerospatiale, 1:5—21, 1994. 

[23] N. Vannieuwenhoven and K. Meerbergen. IMF: An incomplete multifrontal LU-factorization for 
element-structured sparse linear systems. SIAM J. Sci. Comput., 35(1):A270-A293, 2013. 

[24] S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and J. Demmel. Optimization of sparse matrix- 
vector multiplication on emerging multicore platforms. In Proc. ACM/IEEE Conf. Supercomputing 
(SC), 2007. 

[25] P. Wong and D. Zingg. Three-dimensional aerodynamic computations on. unstructured grids using 
a newton-krylov approach. Computers & Fluids, 37:107 120, 2008. 

[26] Drag Prediction Workshop. URL:http://aaac.larc.nasa.gov/tsab/cfdlarc/aiaa-dpw/ 
Workshop3/workshop3.html 


21 


